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ABSTRACT 



In developing tests for the International Association 
for the EValuatioA^ of Educational Achievement (lEA) s-urvey, 
methodological problems arose in three areas: curriculum, 
communication, and culture* Efforts to identify the core of common 
objectives and tjje penumbra of distinctive, sometimes partly shared 
but sometimes unigue, goals operated through a system of national and 
in1;ernational committees. Each country was given the responsibility 
of assembling a national committee, each leaving the task of preparing 
a national bluepri*nt of content and process objectives that would be 
appropriate at the specified age or grad^ iVvels. in that country^ 
Through interaction with national and international .committees, items 
were* selected, edited, and assembled into preliminary forms for 
try-out. Communication wa^s a problem in maintaining the flow of 
information, materials, and actions out to the participating 
countries • and back to the central coordinating office .''In a more 
specific sense, communication was a problem in the domain^^^ language 
and translations Problems involved in the area of culture were* 
semantic and in picking \ set of g.uantitative alternatives giving 
good differentiation between countries. (Authpr/RC) 
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if I had to produce a capsule summary j of our methodological 
problems in developing the lEA instruments, I vould say, "Curricu- 
lum, Communication and Culture." Let .ne expand on this to provide 
clarification and substance • * 

*V/henev^r a test is^ to be given to evalijate educational achieve- 
ment, it is import^'t* that the test 'tasks match the learning outcomes 
that are set as objectives of the instructional program that is being 
evaluated. This is the familiar notion -of content validity drummed 
into every student in his introductory testing course. It gets fan- 
cied up vith lists of behavioral objectives and criterion references, 
but it is still the ancient maxim of "test vhat you teach.". 

Achieving eH?3:ecise match beiween instructional objectives and 
test task6 presents problems ev^ vithin a country if there is a de- 
gree of decentralization and diversity--as there emphatically is in 
the U.S.A. -Wha/; is 'the main theme in one social studies program, for 
example, may be perceived as periphersil or even irrelevant in another. 
But the diversity seems likely to be compounded if one deals vith 10 
or 15 or 20 countries. How shall one deal vith that diversity? 

The problem has t\^o sides: (l) How shall one determine the di- 
mensions of the diversity? (2) Having identified the community and 
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the diversity of objectives in different covin tries, how is one to deal 
with what one finds? , 

In the lEA studies,* our efforts to identify the core of common 

objectives and the penumbra pf distinctive, sometimes partly share4 ' 

it 

but sometimes unique goals operated through a system of national and 

international committees. Each participating country was given the 

responsibility of assembling a national committee, presmably well , 

versed in the curriculum of math or science or reading instruction in 

that ^ountry. Each national committee *had the task of preparing a 

■ . ' * , , 't - ' 

national blueprint of content and process objectives that would be 

appropriate at the specified age or grade -levels' in that coimtry. The ^ 



national lout lines were to be, fed in to a central international subject' 
matter committee that had the responsibility of collating them, iden-. 
tifying areas of agreement and areas of divergence, and then proposing 
a composite international blueprint. This was then returned to the 
national committees for review, criticism, and .suggestions for modifi- 
cation. With varying amounts of interaction back aad forth, the content 
by process blueprint was stabilized in a final form. - 

The same type of reciprocal interaction was to take place in the. 
preparation of test exercises. ^ Tliat is, the national committees were 
invited to submit possible exercises to an item pool, and these were 
reviewed by the central international committee. A selection of pos- 
sible items was made, and these were sent back to the national centers* 



for review and comment. In the light of such comments as were received, 
items were selected, edited and assembled into preliminsiry forms for 
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try-out. . • • 

This, at least, is how things operate^ in theory. But if ^ you 
know anything aTpout humankind, you knot; that national centers varied 
widely in the promptness and in the meticulousness with which they res-* 
ponded to requests for materials or for reactions to materials • Thus, 
inputs from national centers tended to be spotty, with some having much 
more influence than others on the final product, and a disproportionate 
share of the determination of what appeared in the final tests fell 
upon the central international subject committees. The logistical 
problems of maintaining an effectively functioning, world-wide commun- 
ication net^7ork for a pi^oject of this sort are very severe indeed. 

One sti*ategy would say: Build a separate test for each country, 
to match that country^ s objectives. This is a conceivable strate/jy 
if one thinks of countries solely as opportunities to replicate in 
different settings some' strictly intra-national types of sfnalysis. If, 
for exanrple, one wanted to study in a number of countries relationships 
of sex of teacher and sex of student to mathematics achievement (asstan- 
ing that this were a problem worth studying), it would not seem impor- 
tant to use the same identical math test in each country. Different f 
tests, each tailored to the objectives of the specific country, would 
seem to t)rovide legitimate evidence on a problem such as this. It is 
possible 4;hat the specific content of the test would interact with sex 
of teacher and student, but it seems unlikely. Hor^ever, if the enter- 
prise is concerned in part with comparing the levels of achievement 
reached in different countries, there would seem to be no way to do 



this exQiept through a common set^of test tasks. ^Vhat, then, should be 
the specifications fot these tasks? At the two extrjemes, they n^ght 
be' either (l) limited to tasks that correspond to objectives espoused 
by all countries or (2) extended to include all objectives espoused ^ 
Vy any country. An intermediate position would be to plan to assess 
objectives agreed to by several but not all participating countries. 

No one of these choices is ideal. Limiting the assessment to 
universal objectives is likely to produce an excessively narrow test, 
iand one th^t is least adequate for the system with the most inclusive . 
curriculum. Including the complete range' of objectives implies test- 
ing students in some countries on mfiny topics on which they have had . 
no instruction. An intermediate stage r^presfents a compromise between 
these two ills, but not the elimination of either of them. Inciden- 
tally, I believe that this compromise solution is the one^^that • 
adopted in most of the cases. It is also my impression that the sit- 
uation was not quite as desparate as I have made it sound, since in 
large part the ^ntent and objectives in mathematics or science or 
reading were common across countries. 4 further adaptatign to the dif- 
ferences that. clearly did exist in balance and e?nphasis.was to provide 
part scores .and item statistics, so that a country^s aqhievenent could 

. - , s 

V 

be compared with the others not merely on total mathematics score, foi 
example, but on arithmetic, algebra and. geometry, or on computational 
skills vs. problem solving. National profile patterns were in some ways 
more instructive than national standing on the "educational Olympics." 
One final adaptation, was to get in each country estimates of how com- 
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monljr students had; been taxight the eon tent covered' by each Iteij^, and, 

r 

to use this measure of "opportunity to learn" as one independent varl- 
able in a number of analyses • 



Nty second^eyt term vas "communication." This*'vas a problem in 
two quite different senses. One I have already .alluded to. This was 
the logistic problem of maintaining the flow of information, materials^, 
and actions out to the participating countries and back to the central 
coordinating office of the project. It is hard enough, to try to keep 
a single national survey, directed out of a single national head-quar- 



ters, operating smoothly and on schedule. Adding an additional layer 
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of coordination on top of this, with additional flow of information and 
materials back and forth across o'ceans and continents at each step in 
the way makes maintenance of an established schedizle of operations al- 
most impossible of fiilfillment. We 15'arned of the difficulties as we 
went along — of floods in Hungary and ppidemics in Aberdeen, or mark- 
sense cards lost in .transit or swallowed up by Customs, of wehj.-inten- 
tioned national centers that never did get the try-out^ booklets admin- 
istered. We came, to realize the, absolutely vital importance of a** 
strong iater'hational office, with a compulsive administrator to pionitor 
the flow of information and material. 

In the most re6ent .cycle of studies, we adopted the strategy of . 

\ 

having in each countly a nearly full-time NationcG. Technical Officer, 
who provided the responsible dynamic within the country to meet commit- 
ments and deadlines. We were Inqpressed with the necessity of spelling 
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out all procediires and schedules in operating' manuals that vere^ infin- 
itely detailed. We came to reljr upon intensive .veek-long briefing ses- 
sions^of the National Technic^al ^Officers at vhich all procedures vere 

' reviewed and even the most minor details- vorked out. But even so, par- 

» 

ticipation in planning .and reviev were /spotty, ahd we still had one or 
' tvjo instances in which operational slippage occurred-:-such an unhappy 
event as an item being mis-keyed, or a co\intry testing fif^:h graders 
' * instead of 10-year-olds. 

The other sense in which "communication" was a problem -was more, 
specifically in the domain of language. In the survey of achi^^ent 
in science, in which we had the greatest niunber of participating coun- 
tries, it was necessary to translate all materials into 1^^- different- 
languages ranging from Finnish to Japanese. The translation was re- 
quired not only for the tests but also for questionnaires for students, 
teaphers and school officials, and^in addition all the manuals and pro- 
cedural guides that directed the work of the coordinator in a school 
system and the *test administrators who actually carried out the test- 
ing. It was a horrendous task! 
^ ' At this point the question jarises: ^ How adequate was the trans- 
lation? Did a given te^t exercise present the same task after trans- 
lation into each of the languages? Did the background questionnaires 
present in all essential respects the same questions to children or 
teachers in each country? How does one know? . I should not<^^Jj;l*pass- 
ing that English was the common language through which everything pas- 
V sdd on its way to the other languages. That is, if the Finnish Nation- 
al Center contributed a biology item, it .was tx'ans3.ated -from Finnish 

r • 

into English before being translated into Italian, Japanese, Hindi, 
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Thai and all the others,# 

It is perhaps for the reading tests that one becomes most con- 
cerned with prot)lems of translation, since in these tests .language 
appeeirs to he of the essence* V/h^t evidence can one present that the 
test task has not been subtly or even grossly distorted by the process 
of translation? 

Our original hope had been to get on immediate .and independent 
backr translation of all of the passages and items, and to use this to 
police any distortions that might seem to have crept in. Alas, neither 
time nor resources of translators were availably to make this possible. 
We do 'have back translations of selected passages, together with their 
items, but these were received after th6 fact, and could not be used / 
to make any modifications of the tests. 

Two lines of evidence from .prior studies had led us to believe 
that translation problems might not be too serious. One has to do 
with the consistency of relative item difficulty from one language to 
another. We had included a little reading test in oxu: initial pilot 
study reported in 1962. ^In this^ study the correlation from language 
to ^language of item difficulties, expressed as percent getting? the item 
right, was O.90 and this high correlation seemed to ^suggest that each 
item maintained its character with little change tinder translation^ A 
^secor^d line of evidence comes from a Teachers College do^Loral disser- 
tation studying the' possibility of using the combination of a reading 
tes£ i,n English and one'in th^ native language (in th±k case Turkish) 
ac a basis for appraising both scholastic aptitude and .degree of mas- 
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tery of English of foreign students who might come for college studies 
. in the U.S.A. ^Che cross -language difficulty indices din't correlate 
as well in this case -"-about 0.70 — bub a back translation was' produced. 
In this study no significant differences in difficultj'- were found in 
mean scores on the original and the re-translated versions pf the tests 
when given to high school stud^ts in the U.S. A, 'For one form, the 

, » 

correlation of item difficulties between original and re- translated 
form (corrected for the unreliability of the indices) was 0;95, while 
for the other form it was 0.77. Thus, the items and tests did not' ' ' 
.s-eem to have been too badly distorted, by translation into Turkish and 
back again. 

So we went ahead and translated the materials not only for the 
tests of Mathematics, Science {ind civic Education, but also the pass-* 
ages used to measxxre readias conrprehension arid literary comprehension 
and appreciation. It is only for the Reading CoirrpreKenslon Test that 
I have had .a chance to examine the consistency of item statistics from 

language to language. Alas, the correlations are not as high as ''those 

^ . "« * ' , . ' ' 

that.we found in our pilot study. The average cross-language corre- 

lations of item difficulty were approximately 0,75 for 10-year-olds, 

0.70 for lU-year-olds and 0.65 at the end of secondary school. For 

item discrimination indices the corresponding correlations were about ' 

0.60, O.hO and*0.U5. ' * 

t ' The results suggest that maintaining comparability under trans- 

lation becomes a progressively more serious problem as the material to 

be translated becomes more difficiilt. ^ This is perhaps not svtrprising. 
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It may arise, from either or botli of two influences. On the one hand,, 
simple ideas and simple* itoms may have more exact counterparts in 
other languages. On the other, simple materials place less of a strain 
vrpon the cognitive and liguistic skills of t^ie traifelators. Thus, the 
most difficult passages were ones that had in the past been used as 
'part of an admis'sions test for, doc toral> students at Teachers College'. 
It would not be surprising if even a very capable Irani^an educator, for 

I 

exanrple, whose native language was not English, had difficulty in ren- 

iiering precisely in Persian a passage on the philosophy of science or 

the detejmnation of gross national product. I have a sneaking sus-i , 

picion that* reading a back-translation for a few of the most difficult 

passages, if they had been prepared, would have been a somewhat gruesome 

expe3::4eiice . * , . . 

We attempted to carry out a scrut5jiy of -those items in which cer- 

tain countries showed sharply deviating responses— deviating especially 

on the error choices that they selected. Our effort was to understand 

why the discrepancies arose. Vfe asked the National Technical Officer in 

each country to give a rationale for each of the peculiarities of res- 

porise in his country. We asked him to try to Judge vhether the pecu- 

liarity arose from some idiosyncracy of the national language or^from 

some idiosyncracy of the national culture. But the effort wasn^'t 

very productive. The judges expressed very great difficulty in making 

the judgijients, and the rationalizations- that they offered were^singu- 
* 

larly unconvincing. The only really convincing explanation arose in 
one 'or two instances in which they had reversed the order of the op- 



tions, or made an error in the? scoring key^ 



Mention of culture, brings us to . the third potential problem in 

prepaidng iii^truments for use-^in various countries. Are the tests, 

* • • * 

and especially the questionnaires, suited to the culture of, each of 

the countr$.es iny<y.ved? For exajr5)le, one reading passage concerned 

Ernenek, an Eskimo . boy, vho lived in a snow igloo oh the top of the 

vorld and "iced" the runner:^ of his sledge to make them slide better 

on the 'ice arid snow. How does a^ passage of* this ^pe perform in Finr 

land and Sweden on the one hand, which were the n\pst nearly arctic ^ 

of our countries, and the Netherlands and Cliile on the other, whete 

it is Tinlikely that anything remotely reseinbiing an Eskimo or' a sledge 

has> ever been seen? It is comforting t6 find that Finland and Sweden^* 

do relatively no better on this passage than others^ 'and the Nether- 

lands and Chile relatively no worse. I have not made a systematic 

cheysk passage by passage, 'and this should probably be done to see 

whether, national variations on specific items are peculiar' to the 

item, or reflect something more general about * the passage as a whole. 

On the questionnaires, some problems arose relating to the wpr- 

r ' 

ding of the questions. However, the major difficulties centered on the 
response options. In ^order to keep tt;ie data reduction within manageable 
limits, every effort was made to pre -code the options on the question- 
naire completed by the students, teachers and^.a school administrator. 
A given response option needed to be uniform across all countries if 
the data were to be reduced to alphabetic or numerical codes, consol- 



* - 11 - 

idateci within countries and compared across countries. But in pre- 

; y - 

paring ''these codes two types of 'problems were encountered. These 
will he illustrated hy seme fairly representative example^. 

The firs^t type of problem was semantic. Consider the question: 
"Which of the following be'st characterizes the community ^^erved l?y 
this schoQl?" The alterna^tiveq in bhe EnglisTi version are various 
combinations of, "urban," "suburban," and "rural." It seems likely 
that "urban" and ^'rural" will have fairly uniform meaning, but are 
"suburbs" as we think of them a meanirigfull concept in all cultifres? 
Or again, in' a question about the amount of training in physics that . 
a science teacher has had, how does "between 2 and U semesters" con- 
vert into the training programs in Englarid, or Hiingary, or Iran, .to 
say nothing* of a U.S. university on the quarter system. « 

• ^ 

rne second type of problem relates to picking a -set of quanti- 
tative alternatives that gives good differentiation between countries. 
This can be illustrated by the question: "How m^iny books are'' there * 
in your home?" Response categories ranged from a low of "None" to a 
high of "More than 50." These options worked well in countries such' 
as Chile and India, but in Sv/eden .About 80 percent of the respondents 
marked the highes,t catefory', .and there was, as a result, very^little ' 
spread across the groiip of Swedish respondents.. 

Of co\Jiree','all the questionnaires encountered the fbll range of ^ 

problt^.ms thatjjjague questionnaire and survey studies within a country. 

Options appeared hot to be applicable . in individual cases. Many 

ii * > * 

schools appeared to have only impressionistic data on expenditures 



within their school. One may' question the accuracy of student responses 
to questions about parental occupation and education, though some ]jre- 

liminary studies indicated that pretty good correspondence vas obtained 

/ 

between student and parent reports. These internal problems become 

t / 

accentuated by the difficulties in maintaining equivalence of meaning 
across languages and c\atures* Thus, relationships (or the lack of 
them) between family and school factors and the dependent variables of 
school achievement need to be scrutinized critically by the researcher 
in the country involved to examine the possibility that unexpected 
results may represent some deficiency in the instrument, rather than 
a genuine peculiarity of the partiauletr educational system* , 

In my presentation I have focussed on the methodolgoical problems , 
Obyiously, we have felt that we have arrived at tolerable solutions to 
these problems, though far from ideal ones, because we did proceed 
with the study. But reviewers of the findings must remember that this 
is a large scale survey^ type of study, with all the limitations in ^ 
types of data and integrity of Ihe results that this implies, and that 
in a cross-national study thesej limitations are doubled in spades. 
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